Charlottesville Bus Ridership Analysis

Intro

Public transportation is a crucial component of Charlottesville’s urban infrastructure. It’s associated with social mobility, urban accessbility, and economic development.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
library(lubridate)
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(leaflet.extras)
library(readr)
library(htmlwidgets)
df <- read_csv("~/Downloads/Transit_2020.csv")
## Rows: 215407 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Stop, Route, Date_Time, FareCategory, PaymentType
## dbl (5): TransitID, Count, Fare, Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Checking the FareCategory

df |> group_by(FareCategory) |> summarise(n=n()) |> arrange(desc(n))
## # A tibble: 30 × 2
##    FareCategory          n
##    <chr>             <int>
##  1 Trolley Free Ride 50921
##  2 30 Day Full       32557
##  3 24 Hour Full      29480
##  4 UVA Health        25952
##  5 UVA Academic      25365
##  6 One Ride Full     12239
##  7 ADA               11535
##  8 30 Day Reduced     7856
##  9 Youth              5960
## 10 24 Hour Reduced    3045
## # ℹ 20 more rows

I imported the data set from “City of Charlottesville Open Data”, a data set website by the Charlottesville government to provide access to the general public about the municipal information. The data set I chose took data from the two bus systems, The CAT (Charlottesville Area Transit) and the UVA transit system(gold lines, green lines, orange lines, silver lines). It collects the stops that the buses make with the stops’ coordinates and time in the day.

First Research Question

My first research question was “What’s the busiest hour of a weekday?” I first filtered out the data from Monday to Friday. Then, I grouped the data by the column “Hour” and computed the sum of the stops for every hour. At last, I created a bar chart using ggplot with theme “classic” because I want the bar chart to be as neat as possible.

df$FareCategory = ifelse(grepl("UVA", df$FareCategory), "UVA", df$FareCategory)
df <- df %>%
  filter(FareCategory %in% c("UVA", "Trolley Free Ride"))
df$FareCategory <- ifelse(grepl("UVA", df$FareCategory), "UVA", "Non-UVA")

# Step 2: Classify stops based on whether they EVER had a UVA fare
stop_type <- df %>%
  group_by(Stop) %>%
  summarise(StopType = ifelse(all(FareCategory == "UVA"), "UVA Stop", "Non-UVA Stop"))

# Step 3: Join back into original data
df <- df %>%
  left_join(stop_type, by = "Stop")

# Step 2: Create time features
df <- df %>%
  mutate(Date_Time = ymd_hms(Date_Time),
         Hour = hour(Date_Time)-4,
         Date = as.Date(Date_Time))

df$Hour <- ifelse(df$Hour < 0, df$Hour + 24, df$Hour)


# Step 3: Filter to weekdays
df_weekdays <- df %>%
  filter(wday(Date_Time) %in% 2:6)

# Step 4: Total ridership per hour per day per StopType
hourly_totals <- df_weekdays %>%
  group_by(StopType, Date, Hour) %>%
  summarise(Daily_Total_Ridership = sum(Count, na.rm = TRUE), .groups = "drop")

# Step 5: Average ridership per hour per StopType
hourly_avg <- hourly_totals %>%
  group_by(StopType, Hour) %>%
  summarise(Average_Ridership = mean(Daily_Total_Ridership, na.rm = TRUE), .groups = "drop")

ggplot(filter(hourly_avg, StopType == "UVA Stop"),
       aes(x = Hour, y = Average_Ridership)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Weekday Total Ridership per Hour – UVA Stops",
       x = "Hour of the Day", y = "Total Ridership per Hour") +
  theme_classic() +
  scale_x_continuous(breaks = seq(0, 23, 1))

ggplot(filter(hourly_avg, StopType == "Non-UVA Stop"),
       aes(x = Hour, y = Average_Ridership)) +
  geom_bar(stat = "identity", fill = "firebrick") +
  labs(title = "Weekday Average Ridership per Hour – Non-UVA Stops",
       x = "Hour of the Day", y = "Average Ridership per Hour") +
  theme_classic() +
  scale_x_continuous(breaks = seq(0, 23, 1))

# Get UVA and Non-UVA ridership vectors
uva_ridership <- filter(hourly_avg, StopType == "UVA Stop")$Average_Ridership
non_uva_ridership <- filter(hourly_avg, StopType == "Non-UVA Stop")$Average_Ridership

# Perform Welch Two Sample t-test (assumes unequal variances)
t.test(uva_ridership, non_uva_ridership, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  uva_ridership and non_uva_ridership
## t = -4.2166, df = 23.522, p-value = 0.0003154
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -69.82261 -23.89890
## sample estimates:
## mean of x mean of y 
##  32.33412  79.19488

Second Research Question

My second research question was “What is the busiest area?” I first downloaded the Charlottesville region map from the Charlottesville open data. Then I used the leaflet package, a widget that can be rendered on HTML pages. I imported the Charlottesville map, and used the coordinates information of each stop from the Transit_2020 data set to create a heatmap and a dot density map. In addition, I created a marker using the coordinates of UVA, just to see what’s the impact of UVA population on the stops of the buses.

#classify uva and non-uva
df_uva <- df %>%
  filter(StopType == "UVA Stop") %>%
  select(Latitude, Longitude, Count)
  ##|>na.omit()

df_nonuva <- df %>%
  filter(StopType == "Non-UVA Stop") %>%
  select(Latitude, Longitude, Count)
  ##|>na.omit()

The heat map

charlottesville_boundary <- st_read("~/Downloads/Municipal_Boundary_Area.geojson") 
## Reading layer `Municipal_Boundary_Area' from data source 
##   `/Users/dylanli/Downloads/Municipal_Boundary_Area.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 5 fields
## Geometry type: POLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -78.52377 ymin: 38.00968 xmax: -78.44636 ymax: 38.07053
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
charlottesville_boundary <- st_zm(charlottesville_boundary)

uva_lat <- 38.0336
uva_lon <- -78.5070

heat_map <- leaflet(df_uva)|>
  addTiles() |>
  addHeatmap(
    lng = ~Longitude, lat = ~Latitude,
    intensity = ~Count, 
    blur = 20, radius = 15, max = max(df_uva$Count, na.rm = TRUE)
)|>
  addPolygons(
    data = charlottesville_boundary,
    color = "blue", weight = 2, fillOpacity = 0.1,
    popup = "Charlottesville City Boundary"
  ) |>
  addMarkers(
    lng = uva_lon, lat = uva_lat,
    popup = "University of Virginia",
    label = "University of Virginia",
    labelOptions = labelOptions(noHide = TRUE, direction = "top", textOnly = TRUE)
  ) |>
  setView(lng = -78.5070, lat = 38.0336, zoom = 13)
# Save the interactive map as an HTML file
saveWidget(heat_map, "heat_map.html")

library(webshot)

# Capture the HTML map as a PNG image
webshot("heat_map.html", file = "heat_map.png")

charlottesville_boundary <- st_read("~/Downloads/Municipal_Boundary_Area.geojson") 
## Reading layer `Municipal_Boundary_Area' from data source 
##   `/Users/dylanli/Downloads/Municipal_Boundary_Area.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 5 fields
## Geometry type: POLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -78.52377 ymin: 38.00968 xmax: -78.44636 ymax: 38.07053
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
charlottesville_boundary <- st_zm(charlottesville_boundary)

uva_lat <- 38.0336
uva_lon <- -78.5070

heat_map <- leaflet(df_nonuva)|>
  addTiles() |>
  addHeatmap(
    lng = ~Longitude, lat = ~Latitude,
    intensity = ~Count, 
    blur = 20, radius = 15, max = max(df_nonuva$Count, na.rm = TRUE)
)|>
  addPolygons(
    data = charlottesville_boundary,
    color = "blue", weight = 2, fillOpacity = 0.1,
    popup = "Charlottesville City Boundary"
  ) |>
  addMarkers(
    lng = uva_lon, lat = uva_lat,
    popup = "University of Virginia",
    label = "University of Virginia",
    labelOptions = labelOptions(noHide = TRUE, direction = "top", textOnly = TRUE)
  ) |>
  setView(lng = -78.5070, lat = 38.0336, zoom = 13)
heat_map
library(leaflet)
library(htmlwidgets)
library(sf)

# Reading the boundary geojson file
charlottesville_boundary <- st_read("~/Downloads/Municipal_Boundary_Area.geojson") 
charlottesville_boundary <- st_zm(charlottesville_boundary)

# Coordinates for University of Virginia
uva_lat <- 38.0336
uva_lon <- -78.5070

# Generate the leaflet map
heat_map <- leaflet(df_afternoon) |>
  addTiles() |>
  addHeatmap(
    lng = ~Longitude, lat = ~Latitude,
    intensity = ~Count, 
    blur = 20, radius = 15, max = max(df_afternoon$Count, na.rm = TRUE)
  ) |>
  addPolygons(
    data = charlottesville_boundary,
    color = "blue", weight = 2, fillOpacity = 0.1,
    popup = "Charlottesville City Boundary"
  ) |>
  addMarkers(
    lng = uva_lon, lat = uva_lat,
    popup = "University of Virginia",
    label = "University of Virginia",
    labelOptions = labelOptions(noHide = TRUE, direction = "top", textOnly = TRUE)
  ) |>
  setView(lng = -78.5070, lat = 38.0336, zoom = 13)

# Save the interactive map as an HTML file
saveWidget(heat_map, "heat_map.html")

library(webshot)

# Capture the HTML map as a PNG image
webshot("heat_map.html", file = "heat_map.png")

The dot density map

library(htmlwidgets)

charlottesville_boundary <- st_read("~/Downloads/Municipal_Boundary_Area.geojson") 
## Reading layer `Municipal_Boundary_Area' from data source 
##   `/Users/dylanli/Downloads/Municipal_Boundary_Area.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 5 fields
## Geometry type: POLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -78.52377 ymin: 38.00968 xmax: -78.44636 ymax: 38.07053
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
charlottesville_boundary <- st_zm(charlottesville_boundary)

uva_lat <- 38.0336
uva_lon <- -78.5070

circle_map <- leaflet(df_uva) |>
  addTiles() |>
  addCircles(
    lng = ~Longitude, lat = ~Latitude,
    radius = ~Count * 10,
    weight = 1, color = "red", fillOpacity = 0.5,
    popup = ~paste("Ridership Count:", Count)
  ) |>
  addPolygons(
    data = charlottesville_boundary,
    color = "blue", weight = 2, fillOpacity = 0.1,
    popup = "Charlottesville City Boundary"
  ) |>
  addMarkers(
    lng = uva_lon, lat = uva_lat,
    popup = "University of Virginia",
    label = "University of Virginia",
    labelOptions = labelOptions(noHide = TRUE, direction = "top", textOnly = TRUE)
  ) |>
  setView(lng = -78.5070, lat = 38.0336, zoom = 13)

saveWidget(circle_map, "dot_dens_map.html",selfcontained = TRUE)

library(webshot)

# Capture the HTML map as a PNG image
webshot("dot_dens_map.html", file = "dot_dens_map.png")

The heatmap shows that UVA is a transit hub, as the hottest (reddest) coverage are concentrated around UVA. The dot density map provides a more detailed route of the buses. We can see that the main bus line goes through the downtown corridor and the neighborhoods in the north part of Charlottesville.

After a comparison with the poverty distribution in Charlottesville, we can see that there are in general less bus coverage in poorer neighborhoods such as Woolen Mills, Ridge Street, and Belmont. People who live there need more public transportation, but might not be able to do so due to the lack of infrastructure, funding, and the historical reason of gentrification.

Conclusion

In general, this study provides information on the busiest hour and region. It helps me realize the existing flaws of the bus system as the poor neighborhoods were not fully covered. I hope the research can demonstrate to people what the transit system looks like and bring more attention to the vulnerable community.

Stats Test

prop.test(c(52126,50921),c(103047,103047))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(52126, 50921) out of c(103047, 103047)
## X-squared = 28.135, df = 1, p-value = 1.131e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.007366952 0.016020434
## sample estimates:
##    prop 1    prop 2 
## 0.5058468 0.4941532
#P(UVA)-P(Trolley) = [0.51,0.49]